Writing Data Frame / Plotting Functions

Wednesday, May 15

Today we will…

  • Discuss Groupt Project + Group Contract
  • New Material
    • Calling Functions on Datasets
    • Thinking About Missing Data
  • Lab 7: Functions and Fish

Group Project Details

Check out the Canvas page outlining the group project!


  • Groups have been assigned.
  • Your group contract is due on Monday!

Calling Functions on Datasets

Last Time…

We wrote a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).

  • find_car_make("Toyota Camry") returns “Toyota”.
  • find_car_make("Ford Anglica") returns “Ford”.
find_car_make <- function(car_name){
  make <- str_extract(string = car_name, 
                      pattern = "[:alpha:]*")
  return(make)
}

Pair Our Function with dplyr

Consider the mtcars data.

data(mtcars)
head(mtcars, n = 3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

. . .

Let’s use our new function:

mtcars |> 
  rownames_to_column("make_model") |> 
  mutate(make = find_car_make(make_model),
         .after = make_model) |> 
  head(n = 3)
     make_model   make  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4  Mazda 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag  Mazda 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 Datsun 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Recall the penguins Data

library(palmerpenguins)
data(penguins)
penguins |> 
  head()
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Function to Standardize Data

We want to take in a vector of numbers and standardize it – make all values be between 0 and 1.

. . .

std_to_01 <- function(var) {
  stopifnot(is.numeric(var))
  
  num <- var - min(var, na.rm = TRUE)
  denom <- max(var, na.rm = TRUE) - min(var, na.rm = TRUE)
  
  return(num / denom)
}

Standardizing Variables

Is it a good idea to standardize (scale) variables in a data analysis?

Why standardize?

  • Easier to compare across variables.
  • Easier to model – standardizes the amount of variability.

Why not standardize?

  • More difficult to interpret the values.

. . .

E.g., a penguin with a bill length of 35 mm (std to 0.11) and a mass of 5500 g (std to 0.78).

Pair Our Function with dplyr

Let’s standardize penguin measurements.

penguins |> 
  mutate(bill_length_mm    = std_to_01(bill_length_mm), 
         bill_depth_mm     = std_to_01(bill_depth_mm), 
         flipper_length_mm = std_to_01(flipper_length_mm), 
         body_mass_g       = std_to_01(body_mass_g))
  • Ugh. Still copy-pasting!

. . .

Recall across()!

penguins |> 
  mutate(across(.cols = bill_length_mm:body_mass_g,
                .fns = ~ std_to_01(.x))) |> 
  slice_head(n = 4)
# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
# ℹ 2 more variables: sex <fct>, year <int>

Use variables as function arguments?

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}
Note

I used the existing function std_to_01() inside the new function for clarity!

. . .

But it didn’t work…

std_column_01(penguins, body_mass_g)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found

Tidy Evaluation

Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.

Tidy:

penguins |> 
  pull(body_mass_g)

  OR

penguins$body_mass_g

Untidy:

penguins[, "body_mass_g"]

  OR

penguins[["body_mass_g"]]


. . .

Tidy evaluation isn’t naturally supported when writing your own functions.

Defused R Code

When a piece of code is defused, R doesn’t return its value like normal.

  • Instead it returns an expression that describes how to evaluate it.

. . .

Evaluated code:

1 + 1
[1] 2

Defused code:

expr(1 + 1)
1 + 1

. . .

We produce defused code when we use tidy evaluation and our own functions don’t know how to handle it.

Solution 1

Don’t use tidy evaluation in your own functions.

  • This is more complicated to read and use, but it’s safe.
std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data[[variable]] <- std_to_01(data[[variable]])
  return(data)
}

std_column_01(penguins, "bill_length_mm")

Solution 2: rlang

Use the rlang package!

  • This package provides operators that simplify writing functions around tidyverse pipelines.
knitr::include_graphics("https://github.com/rstudio/hex-stickers/blob/main/thumbs/rlang.png?raw=true")

  • Read more about using this package for function writing here!

Solution 2: rlang

Two ways to get around the issue of defused code:

  1. Embrace Operator ({ })
  • With { }, you can transport a variable from one function to another.

. . .

  1. Defuse and Inject
  • You can first use enquo(arg) to defuse the variable.
  • Then use !!arg to inject the variable.

Solution 2: rlang

If we use either of these solutions, we also need to use the walrus operator (:=).

  • This means we have to use := instead of = in any dplyr verb containing one of these rlang fixes.

Recall Our Broken Function

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}

std_column_01(penguins, body_mass_g)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found
  • The code is defused, so mutate() doesn’t know what body_mass_g is.
  • We need to modify variable to make this work correctly!

Fixing Our Broken Function

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))

  data <- data |>
    mutate({{variable}} := std_to_01({{variable}}))
  return(data)
}

std_column_01(penguins, body_mass_g)
# A tibble: 6 × 7
  species island    bill_length_mm bill_depth_mm body_mass_g sex     year
  <fct>   <fct>              <dbl>         <dbl>       <dbl> <fct>  <int>
1 Adelie  Torgersen           39.1          18.7       0.292 male    2007
2 Adelie  Torgersen           39.5          17.4       0.306 female  2007
3 Adelie  Torgersen           40.3          18         0.153 female  2007
4 Adelie  Torgersen           NA            NA        NA     <NA>    2007
5 Adelie  Torgersen           36.7          19.3       0.208 female  2007
6 Adelie  Torgersen           39.3          20.6       0.264 male    2007
std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  variable <- enquo(variable)

  data <- data |>
    mutate(!!variable := std_to_01(!!variable))
  return(data)
}

std_column_01(penguins, body_mass_g)
# A tibble: 6 × 7
  species island    bill_length_mm bill_depth_mm body_mass_g sex     year
  <fct>   <fct>              <dbl>         <dbl>       <dbl> <fct>  <int>
1 Adelie  Torgersen           39.1          18.7       0.292 male    2007
2 Adelie  Torgersen           39.5          17.4       0.306 female  2007
3 Adelie  Torgersen           40.3          18         0.153 female  2007
4 Adelie  Torgersen           NA            NA        NA     <NA>    2007
5 Adelie  Torgersen           36.7          19.3       0.208 female  2007
6 Adelie  Torgersen           39.3          20.6       0.264 male    2007

Inject Multiple Variables

What if I want to modify multiple columns?

  • Use across()!
std_column_01 <- function(data, variables) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(across(.cols = {{variables}},
                  .fns = ~ std_to_01(.x)))
  return(data)
}

std_column_01(penguins, bill_length_mm:body_mass_g)
# A tibble: 5 × 7
  species island    bill_length_mm bill_depth_mm body_mass_g sex     year
  <fct>   <fct>              <dbl>         <dbl>       <dbl> <fct>  <int>
1 Adelie  Torgersen          0.255         0.667       0.292 male    2007
2 Adelie  Torgersen          0.269         0.512       0.306 female  2007
3 Adelie  Torgersen          0.298         0.583       0.153 female  2007
4 Adelie  Torgersen         NA            NA          NA     <NA>    2007
5 Adelie  Torgersen          0.167         0.738       0.208 female  2007

Missing Data

Types of Missing Data

  1. Missing Completely at Random (MCAR)
    • No difference between missing and observed values.
    • Missing observations are a random subset of all observations.
  2. Missing at Random (MAR)
    • Systematic difference between missing and observed values, but can be entirely explained by other observed variables.
  3. Missing Not at Random (MNAR)
    • Missingness is directly related to the unobserved value.

Types of Missing Data

Consider a study of depression.

  1. Missing Completely at Random (MCAR)
    • Some subjects have missing lab values because a batch of samples was processed improperly.
  2. Missing at Random (MAR)
    • Subjects who identify as men are less likely to complete a survey on depression severity.
  3. Missing Not at Random (MNAR)
    • Subjects with more severe depression are less likely to complete a survey on depression severity.

When we remove missing data…

We implicitly assume observations are missing completely at random!

  • We might be mostly removing data from subjects who identify as men.
  • We might be mostly removing data from subjects with severe depression.
  • We are inadvertently making our data less representative.

. . .

We need to take more care when dealing with missing values!

Dealing with Missing Data

  • Look for patterns!
    • Do observations with missing values have similar traits?

. . .

  • Consider outside explanations!
    • Why might missing data exist?
    • Should we have a “missing” category in our analysis?

. . .

  • Can we impute values?
    • If depression is MCAR within gender, age, and education level, then the distribution of depression will be similar for people of the same gender, age, and education level.